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Abstract 

The AMP Markov property is a recently proposed alternative Markov property 
for chain graphs. In the case of continuous variables with a joint multivariate Gaus- 
sian distribution, it is the AMP rather than the earlier introduced LWF Markov 
property that is coherent with data-generation by natural block-recursive regres- 
sions. In this paper, we show that maximum likelihood estimates in Gaussian AMP 
chain graph models can be obtained by combining generalized least squares and 
iterative proportional fitting to an iterative algorithm. In an appendix, we give 
useful convergence results for iterative partial maximization algorithms that apply 
in particular to the described algorithm. 

Key words: AMP chain graph, graphical model, iterative partial maximization, multi- 
variate normal distribution, maximum likelihood estimation 

1 Introduction 

In graphical modelling, graphs are used to describe patterns of conditional indepen- 
dence. Undirected graphs encode the conditional independences underlying Markov ran- 
dom fields, and acyclic directed graphs encode the conditional independences underlying 
Bayesian networks. A generalization of both Markov random fields and Bayesian net- 
works is provided by chain graphs that were introduced with the Ma r kov /conditional in- 
depen dence interpretation desc ribed in Lauritzen fc Wermuth ( 19891 ). Wermuth fc Lauritzen 



( 199ft and iFrvdenb^ (Il99nl 'l: see also Ihauritzenl (Il99(il . S5.4.1') and Isdwardsl (12001 
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§7.2). Graphical models jargon refers to the models in duced by this Markov inter- 
pretation as LWF chain graph models. Recently, however, lAndersson et al. I (l2nnih have 



proposed an alternative Markov pr operty (AMP) for chain graphs (see also lLevitz et al 



2nnil : lAndersson fc PerlmanL bond 'l . In the case of continuous variables with a joint mul- 
tivariate Gaussian = normal distribution, it is their AMP rather than the LWF Markov 
property that is cohere nt with data-generation by natural block-recursive regressions 



( Andersson et 120011 . §§1 and 5). 

Statistical inference for LWF chain graph models is well developed, but this is not the 
case for AMP chain graph models. This paper considers maximum likelihood (ML) esti- 
mation in Gaussian AMP chain graph models. After reviewing these models in section|2l 
we derive, in section 13 the likelihood equations and the Fisher-information. Combining 
generalized least squares and iterative proportional fitting, we describe an iterative al- 
gorithm for solving the likelihood equations, which yields consistent and asymptotically 
efficient estimates. The convergence properties of this algorithm can be derived from 
convergence results for iterative partial maximization algorithms that are given in the 
Appendix. An application to university graduation data in section ^ illustrates AMP 
chain graph modelling. We conclude with the discussion in sectional 



2 Gaussian AMP chain graph models 



Let G = {V, E) be a mixed graph with finite vertex set V and an edge set E that may 
contain two types of edges, namely directed (u — > v) and undirected [u — v) edges. 
The graph G is called a chain graph if it does not contain any semi-directed cycles, that 
is, it contains no path from v to v with at least one directed edge such that all directed 
edges have the same orientation. The vertex set of a chain graph can be partitioned 
into subsets r € ^ such that all edges within each subset r are undirected and edges 
between two different subsets t ^ t' are directed. In the following, we assume that the 
partition r G is maximal, that is, any two vertices in a subset r are connected by an 
undirected path. Then the subsets t £ ^ are unique and called the chain components 
of the graph G; compare figure ^ in section 0] 

For a given chain graph G, we consider the class iP{G) of normal distributions 
AA(0, S) o n with positive defin ite covariance matrix S t hat satisfy the AMP Ma rkov 



property ([Andersson et all l200ll . §4) with respect to G. lAndersson et al\ (|200ll . §5) 



described a parameterization of ^{G) that associates one parameter with each vertex 



in V and each edge in E. More precisely, let = 
matrix such that for any distinct vertices u and v 



be a positive definite 



v^E 



(1) 
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and let B = (Buv) be an arbitrary matrix in M^^^ such that for any vertices u and v 

u — > V ^ E ^ = 0- (2) 
For two such matrices Q and B, we set 

^iB,n) = {iv-B)-^n-^{iv-B')-\ (3) 

where ly S R^^^ denotes the identity matrix. A normal distribution AA(0,S) with 
S > satisfies the AMP Markov property if and only if there exist B and Q such that 
(HI) and © hold and S = ^{B,n). 

For a vertex v £ V, let pa{v) = {u £ V \ u — > v G E} be the set of parents of v. 
Furthermore, we set pa(r) = Ut,gT-pa(w). Because of the nonexistence of semi-directed 
cycles, the joint distribution of Xy can be factorized as 

fM = n fi^r I Xpa(^)), XV e M^. (4) 

For T £ the conditional distribution f{xr \ a^pa(T)) is given by 

I -'^pa(r) ~ J^{Br ^pa(r) ,^t^), (5) 

where S,- = {Buv)u£T,v£pa{T) and 1^,- = {0,uv)u,veT are submatrices of i? and Q, respec- 
tively. The conditional distribution corresponds to a block-regression, in which the block 
of variables X^- is regressed on the parents Xpa(T-) • 

The parameter can be rewritten in vectorized form. Let Pr = {Buv \ u G 

T,v £ pa(T)) be the vector of unconstrained elements in B^. Subsequently, we write 
Bt-{Pt) for the matrix defined by Pr and 0. Similarly, let uJr be the vector of elements 
of fluv in such that either u = v, oi u < v and u — v £ E. Furthermore, denote 
the dimension of f3r and Uj- by Pr and Qt, respectively. Then the parameter space for 
the parameter (/3^,a;r) is 

= {{Pr,UJr) G I nr{uJr) > O}, (6) 

where Qri^r) S M"^^"^ is the matrix defined by uJr and (pQ). It follows from Q) and 
(jSJ that 6 = {(3r-,uJT)T&.y parameterizes ^(G). Equation (fT3|) below clarifies that 9 is 
identifiable. The parameter space of ^(G) is the Cartesian product = 'Xt&.^'Qt- This 
factorization of the parameter space together with the factorization of the joint density 
implies that the ML estimator (MLE) of the joint parameter 9 can be obtained by 
computing, separately for every t £ 3' , the MLE of (/?,-, <^r) in the block-regression ©. 
Furthermore, the Hessian of the likelihood function of the model ^{G) is block-diagonal 
with one block for each one of the block-regressions indexed by r S 3" . 
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3 Maximum likelihood estimation 



3.1 Likelihood equations 

Let X = {X^^i)^,^v,ieN G M^^^ now be a data matrix whose column vectors, indexed 
by the set A^, are independent and identically distributed according to some P G ^{G). 
Since, merely for notational convenience, the distributions in ^(G) are assumed to be 
centered the sample covariance matrix is defined as 

S = -XX', 
n 

where n = |A^| is the sample size. We assume that 

n > maxllrl + |pa(r)|| 

such that, with probability one, the submatrices Sr^r, 5'r,pa(T)) the matrix S{(3r) 
defined below are of full rank. This ensures that the MLE exists in each one of the 
block-regressions. Dividing by n and ignoring the additive constant — (|l/|/2) log(27r), 
the log-likelihood function for the block-regression ((S)) is given by 

in{Pr,iOr) = \ log |17.K)I ' \ [^r{u^r) S{(3^)] , (7) 

where 

5'(^t) = - [Xr - BriPr) -^pa(T)][^T " -B^(/3r) -'^pa(T)]' 

= 5'r,r - -BT(/3T)<S'pa(T),r " 'S'r,pa(T)-ST(/3T)' + -B^(/5T)5'pa{r),pa{r)-ST(/3r)' 

is the sample covariance matrix of the residuals in the block-regression @, and X^ G 
j^AxAf (ignotes the submatrix of X that comprises all rows with index in A. 

Let Pr = dvec{Br)/dl3'^ and Qr = d vec{Qr) / duj'^ . Both P-r and Qr have entries 
in {0, 1} and satisfy vec(i?T-) = Prf^r and vec(r2T-) = Qt^t, respectively. Each column 
in Pt has exactly one entry equal to one. A column in has exactly one or exactly 
two entries equal to one depending on whether the associated element in uJr comes from 
the diagonal or the off-diagonal part of fir, respectively. With these two matrices the 
likelihood equations obtained by taking first derivatives with respect to Pj- and iVj- can 
be written as 

[ Vec(f], 5,,p,(,)) - (5pa(r),pa{r) ® ^r) Pr Pr] = (8) 

and 

Q; vec [n-^ - S{Pr)] = 0. (9) 
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Equation © represents in a compact way the fact that the covariance associated with an 
undirected edge in the AMP chain graph is equal to its counterpart in S{(3r), that is, it is 
equal to the empirical covariance of residuals computed for fixed /?,-. Thus, equation Q 
parallels the well-known likelihood equations of undirected Gaussian graphical models. 



3.2 Two-step estimation 

If every vertex in pa(r) is adjacent to all vertices in r, then no constraints on Br are 
imposed and Pr becomes an identity matrix. In this case the first set of equations leads 
to the usual least squares estimator 



),pa{T)- 



(10) 



Thus the MLE of (/?,-, cj,-) can be obtained by fitting an undirected graph model to the 
residuals computed using the regres sion coefficients e s timat e s in (IIUI). This ca n be done 
using iterative proportional fitting ( Speed &: Kiiveri 19861 : Whittaker . 1990l . pp. 182- 
185), which generally will terminate in finitely many steps only if the subgraph Gr 
induced by the chain component r is decomposable. 

In the case of general AMP chain graphs with constraints on Bt, a sim i lar tw o-step 
method can also be used for parameter estimation, as described in Edwardsl ( 2000l . §7.5): 

1. estimate Pr by least squares by regressing each X^, v £ t, on its parents ^pa(i))! 

2. estimate ujr by fitting an undirected graph model to the regression residuals. 

For general AMP chain graphs with restrictions on Br, however, the two equations Q 
and © for Pr and uJr cannot be solved separately and the MLE differs from this two-step 
estimator. 



3.3 Algorithm for maximum likelihood estimation 

To compute the MLE, or rather a solution to the likelihood equations, in the general 
case, we consider an iterative method based on alternately maximizing the likelihood 
with respect to Pr and iVr- Let {Pr,u}r) be a consistent estimator. Then setting co^^^ = Cjr 
we define the sequence of estimators 

/3('=+^) = argmax4(/3r,4'^) (H) 

and 

= argmax a;,) (12) 
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for k>2. Note that /3r can be computed in an explicit formula as the solution to 
(jH)) with substituted by ^t{oJt ), which is 

^(k+l) ^ |p/ [5p,(,)_p^(^) ^'rl^'jP; Vec[f).(4^-)) 5.,paM]}. (13) 

Similarly, lDt can be computed as the solution to (jH) with f5r substituted by /5t 
The equations in ® then correspond to the likelihood equations of an undirected graph 
model for the undirected induced subgraph Gr and the regression residuals as data. In 
other words the undirected graph model for Gr has to be fitted to the sample covariance 
matrix S{(3'^^^^), for which the iterative proportional fitting algorithm can be used. 

The convergence properties of the sequence {(5''T'\uJ'^'^)k^^ are discussed in the ap- 
pendix. In particular, it follows that the sequence converges if there are only finitely 
many solutions to the likelihood equations a nd (lUl) . Note that the l i keliho od eq uations 



may i ndeed have multiple solutions; compare Drton &: Richardson ( 2004h and Drton 



who consider seemingly unrelated regressions that are special cases of the block- 



regressions encountered here. 

3.4 The Fisher-information 

For T ^ ^ the second derivatives of the log-likelihood function are 

a2^„(/3„w,) 



diird^'r 

diOrdu}' 



-^r[5'pa{T),pa(T) ® ^t]Pt, 



and 



^ ^dPrdJj'^ = -P^[(5pa(r),r " 5'pa(r),pa(r)^r (/?r)') ® It\Qt- 

Let 9 = (6r)Te.T = (/^t) '^T)Te,3^ ^ Ist S be the associated covariance matrix given 

by equation Then the Fisher-information J^{9) for the Gaussian AMP chain graph 
model ^{G) is block-diagonal and the r x r-block is equal to 

a(n^ _ /P;(SpaM,pa(r)C3f^r)Pr \ 



3.5 Consistency and asymptotic normahty 

In the following, let Or,n = (/3r, ni^T.n) be the limit of the sequence ,Ldi''^)k(zf>^ for 
sample size n and let 9n = iOT,n)T&,T- Should such a limit not exist then choose Or^n as 
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an arbitrary accumulation point. In either situation, all Or^n ai^e roots to the likelihood 
equations © and @- This, together with the fact that Gaussian AMP chain graph 
models form curved exponential families (theorem leads to the asymptotic normality 
stated in theorem [21 

Theorem 1. The Gaussian AMP chain graph model ^{G) is a curved exponential 
family. 

Proof. The model ^{G) is a subfamily of the regular exponential family of centered 
multivariate normal distributions with arbitrary positive definite covariance matrix. The 
parameter space = Xr£.yQT of .^{G) is an open set in a Euclidian space and in 
particular a smooth manifold. For 6 = {(3t-,^Vt)t&,'7 £ ©i let B{6) be the matrix that is 
zero except for its r x pa(T)-submatrices, r G which are equal to Bt^Pt), and let 
similarly Q{9) be the block-diagonal matrix with blocks Qri^r), t £ By equation 
©, the mapping 

i;:9^ = [Iv - B{9)'] n{0) [ly - B{9)] 

maps the parameter 9 = {(3r,iOr)Te^ G G in the parameter space of 3^{G) to T,{9)~'^ 
with AA(0, E ^(G). The inverse map of ip is determined by the fact that 



compare 



-B(6')i,,pa(i)) = S(6')^^pa(t,)[i;(6')pa(i,),pa(D)] \ V^V] (15) 

Richardson &: Soirte Theorem 8.7). It is now apparent that the map- 



ping is a diffeomorphism. Therefore, ib iQ) is a sm ooth manifold, which means that 



B^{G) forms a curved exponential family (|Kass fc Vos. . .1997. . Definition 2.3.1, 4.2.1). 



Theorem 2. Let 9 = {6r)T&.9, = {Pt-,^t), be the true parameter. Then 9n ^ 9 
in probability, the estimates ^r,n; G , <ire asymptotically independent, and for each 
T £ ^, 



with ^[9)r^T given in 

Proof. The estimators 9n are roots to the likelihood equations, computed in itera- 
tions starting at consistent est imates. Theorems 2.4.1, 2.6.1, 2.6.7, and 2.6.12 (see also 
Corollaries 2.4.2 and 2.6.2) in iKass fc VosI (|l997l ^ imply that in one-parameter curved 



exponential families such roots to the likelihood equations are consistent and asymptot- 
ically normal with asymptotic variance equal to the i nverse of the Fishe r- information . 
As indicated before the statement of Theorem 4.2.4 in iKass fc Voj (|l997l ). these results 



extend to multi-parameter families, which yields our claim. 
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Figure 1: Chain graph with the three chain components {spend, strat, salar}, 
{top 10, tstsc, rejr, pace}, and {agpra}. 

4 Example: University graduation rates 



We ill ustrate our maximum likelihood procedure using the data in lDruzdzel &: Glvmoui 



19991 ). which stem from a study for college ra nking carried out in 1993. Based on 



n = 159 universities. iDruzdzel fc Glvmoun (|1999l . Table 3) state a correlation matrix for 



eight variables that are 

spend average spending per student, 

strat student-teacher ratio, 

salar faculty salary, 

rejr rejection rate, 

paec percentage of admitted students who accept university's offer, 

tstsc average test scores of incoming students, 

top 10 class standing of incoming freshmen, and 

apgra average percentage of graduation. 

Figure ^ shows a chain graph for these variables. This graph has the chain com- 
ponents Ti = {spend, strat, salar}, T2 = {top 10, tstsc, rejr, pace}, and = {apgra}. 
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It was selected via the SIN model selection procedure described in iDrton &: Perlman 



(|2nn4al lbh. More precisely, we used SIN model selection with simultaneous significance 
level 0.15 fixing the chain components ri, r2, and a priori in the temporal order 
Ti < T2 < T3. In the resulting AMP chain graph we deleted the undirected edge between 
toplO and rejr, and introduced the undirected edge between toplO and pace, creating a 
non-decomposable chain component T2. Furthermore, we deleted the edge between salar 
and top 10 to create the edge constellation 

salar — > top 10 — tstsc < — strat. 

The induced subgraph over the four vertices salar, strat, top 10 and tstsc, which also_ 
contai ns the edge salar — strat, forms what is called a 2-b iflag by 



Andersson et al 



(2001); compare their figure 5(d). Therefore, by theorem 4 in Andersson et al. I (|200ll ). 



the AMP and LWF Markov properties differ for the graph in figure ^ 

The block-regression for n is trivial as pa(ri) = and the undirected induced sub- 
graph is complete, and thus the MLE of fin is simply the inverse of S'ti,ti- The 
block-regression for is also simple as T3 contains only a single vertex. In this case, the 
MLE of Pr-j and 0^7-3 can be computed by regressing the single variable in T3, here the 
variable apgra, on all its parents, here the variables pace, salar, and tstsc. The vector of 
least squares estimates of the regression coefficients is the MLE of (3r^ and the inverse 
of the estimated conditional variance is the MLE of LOr^ ■ 

The remaining block-regression for T2 is non-trivial. We apply the ML estimation 
algorithm described in section ESI starting with the identity matrix as initial estimate of 
Ot-j and iterating until convergence to find the estimates stated in the columns labelled 
"MLE" in table ^ Note that we cannot guarantee that these estimates constitute the 
global maximum of the likelihood function. However, using these estimates to evaluate 
the deviance of the AMP chain graph model yields a value of 16.89, which compared to 
11 degrees of freedom indicates a reasonable fit. 

Tabled also states the two-step estimates obtained as described in section These 
estimates coincide with the estimates after two steps of the ML estimation algorithm, 
provided the algorithm is started at a diagonal matrix. The two steps of the ML es- 
timation algorithm consist of one step estimating assuming a diagonal matrix fi,-, 
i.e. assuming independence of all variables in the chain component r, and one step es- 
timating LOr using the newly found estimate of Pr- The two-step estimates are fairly 
close to the MLEs, all differences being clearly smaller than two standard errors. The 
deviance based on the two-step estimates would be 19.18. Interestingly, the two-step 
estimates and the MLEs for the variance parameters tOr are identical in two digits of 
precision with the exception of the conditional variances totopio-, ^tstsc and the inverse 
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Table 1: MLEs, their standard errors computed from the Fisher information matrix, and 
the two-step estimates for the block-regression for chain- component T2- 



Parameter 


MLE 


SE 


2-step 


Parameter 


MLE 


SE 


2-step 


f^paccr- 


-salar 


-0.53 


0.07 


-0.52 


^pacc 


1.46 


0.16 


1.46 




-salar 


0.26 


0.09 


0.30 


^rejr 


1.64 


0.18 


1.64 




spend 


0.30 


0.09 


0.27 


^toplO 


2.99 


0.33 


2.92 


PtoplO^ 


—spend 


0.98 


0.08 


0.99 


^tstsc 


3.39 


0.37 


3.34 


PtoplO 


^strat 


0.44 


0.07 


0.45 


^pacc,rejr 


-0.33 


0.12 


-0.33 


PtstSCr- 


-salar 


0.26 


0.06 


0.36 


^pacc,toplO 


-0.16 


0.14 


-0.16 


Ptstsc<- 


-spend 


0.49 


0.07 


0.43 


^rejr,tstsc 


-0.65 


0.16 


-0.65 












^toplO,tstsc 


-1.76 


0.28 


-1.69 



covariance i^topio,tstsc that all involve the variables toplO and tstsc that are part of the 
biflag. 

5 Discussion 

The likelihood function of a Gaussian AMP chain graph model can be factored into 
the product of conditional likelihood functions. Each chain component of the graph 
gives rise to one factor in this factorization. The iterative algorithm we proposed for 
ML estimation in Gaussian AMP chain graph models takes advantage of this fact and 
treats each chain component separately. For a given chain component, the algorithm 
alternates between estimating regression coefficients while fixing a covariance matrix 
and estimating the (restricted) covariance matrix while fixing regression coefficients. To 
perform the former task of estimating regression coefficients we use a generalized least 
squares formula, whereas the iterative proportional fitting algorithm is used to perform 
the latter task of estimating a covariance matrix. 

The algorithm calls upon repeated runs of iterative proportional fitting in order to fit 
the block-regression model associated with a given chain component. This is in contrast 
to the case of LWF chain graph models, for which the ML estimates of the parameters 
associated with a chain componen t can be computed by running iterative proportional 



fitting only once (jLauritzenl . Il996l . §5.4.1, Proposition 6.33). However, the undirected 
graph on which iterative proportional fitting is run must be derived from the original 
LWF chain graph in a process called moralization. In general, this derived undirected 
graph contains also vertices outside the considered chain component and may feature 
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larger cliques than the undirected subgraph induced by the chain component, on which 
iterative proportional fitting is run when fitting AMP chain graph models. 

The developed methodology for ML fitting of AMP chain graph models permits in 
particular to compare two models based on different chain graphs via likelihood ratio 
tests and information criteria. However, one may also be interested in testing parameter 
equality in a given model. If parameters are set equal in a curved exponential family, 
then the resulting submodel is again a curved exponential family. Therefore, the ML 
estimates in the submodel are asymptotically normal, and the problem of testing pa- 
rameter equality can be addressed by a likelihood ratio test. For the computation of ML 
estimates in such submodels, the algorithm we proposed for fitting AMP chain graph 
models needs to be extended to incorporate equality constraints amongst subsets of the 
parameters. If parameter equality occurs between regression coefficients that appear 
in the same matrix B^, then the generalized least squares step of the algorithm can 
easily be adapted to deal with this new situation. The required changes consist solely 
of removing all but one of the identical entries of the vector f3r and altering the matrix 
Pr accordingly. With these changes, formula still applies. If parameter equality 
occurs between entries of the matrix i}^ then the iterative propor tional fitting step of the 



algori thm has to be adapted. This can be done as described in lHoisgaard fc Lauritzen 



who treat parameter equality in undirected graphical models. Finally if param- 



eter equality occurs between parameters appearing in different matrices and Bf, or 
Ot- and Qf, then the block-regressions can no longer be treated separately. In this case 
the extension of the presented algorithm requires additional work. 

Appendix: Iterative partial maximization 

The algorithm for ML estimation proposed i n this paper is an iterative partial maxi- 



mization algorithm in the sense of iLauritzenl ()l99fil . Appendix A. 4). Partial maximiza- 



tion refers to a maximization of the likelihood function over a section in the parameter 
space. In an iterative partial maximization algorithm, one repeatedly performs a se- 
q uence of pa. r tial m aximizations. In this appendix, we generalize the convergence results 



m 



LauritzenI (|l99fl . Appendix A. 4) by not assuming the existence of a unique local (and 
global) maximum of the likelihood function. 

Let L : 9 ^ W he a differentiable real- valued function on an open set C M*^. In 
the context of ML estimation, L constitutes the (log-)likelihood function and Q is the 
parameter space of a statistical model. Assume that there exists such that Qq = {6 £ 
I L{9) > L{9o)} is compact. Then L has a (not necessarily unique) global maximum 
in ©Q. For functions gi : Q ^ i = 1, . . . ,k and 0* G 0, we define sections &i{9*) in 
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G by 

e,{e*) = {eciQ\gi{9)=gi{9*)}. 

We assume that the maximum of L over the section Gj(0*) is uniquely attained for all 
^* G G and i = 1, . . . , A; and that the associated mapping 

Ti{e*) = argmaxL(6') 

from G into itself is continuous for all i = 1, . . . , A;. Moreover, we assume that if 9* 
maximizes L over all sections Qi(0*), and consequently satisfies 9* = Ti{9*) for i = 
1, . . . , fc, then 9* solves the likelihood equations 



dL{9) 



0. (16) 



d9 

Let 9q €z Q he a starting value such that Gq is compact and define 

9n+i = S{9n)=Ti,---TM, n>0. 

By definition of Gq, we have 9n G Gq for all n > 0. Let be the set of accumulation 
points of the sequence {9n)neN- Since Gq is compact, we have Gq. The following 

results di scuss the properties of ^oo- It is a special case of the convergence theorem in 
Izangwill (|l96fll . Chapter 4). 

Proposition 1. The sequence {L{9n))^^j^ of values of the likelihood function converges 
to a limit i^o G 1^- Furthermore, if a £ Aoo then L{a) = ioo oind a satisfies (|16() . 

Proof. Since the sequence {L{9n))n&i is monotonously increasing and bounded, it 
converges to a limit ioo. By continuity of L, this also implies L{q) = ioo for all a S Aoo- 

Next, since 5 = • • • Ti is continuous, ^(^oo) is the set of accumulation points of 
{S{9n)) = {9n+i). Consequently S{Aoo) = Aoo and L{S{a)) = ioo for all a G ^oo- By 
definition of Tj, we now obtain for arbitrary a G Aoo^ 

ioo = L{n--- n{a)) > L{n-i ■ ■ ■ T^{a)) > L{T^{a)) >L{a)= ioo, 

which implies Tj(a) = a for all i = 1, . . . ,k because of uniqueness of the maximum 
over Qi(a). Thus a maximizes L over all sections and hence satisfies equations ()16() by 
assumption. 

For the next theorem, recall that a compact set is s aid to be connect ed if it cannot be 
partitioned into two nonempty compact sets (see also Ostrowskil . 196fil . Theorem 28.1). 
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Theorem 3. is a compact and connected subset o/Gq- 

Proof. Since is a subset of a compact set, it suffices to show that is closed. 
Let a* S ^oo- Then for any e > there exists a € ^oo such that a £ Bs{a*). Similarly, 
since a is an accumulation point of (On), there exists for every 6 > some ns £ 'N 
such that 6ns ^ -^^(a). Since Bir{a*) is open, we can choose 6 small enough such that 
-6(5 (a) C Bs{a*), which implies 9ng € -Be (a*). Since e was arbitrary, a* £ Aoo, which 
establishes the closedness of ^oo- 

Next, let Bs{Aoo) = € | d{9,Aoo) < e} where d{A,B) is the distance between 
two subsets A and B in M'^. Then for every e > there exists S N such that 
9n £ BelAoo) for all n > Ue. 

Now suppose that can be partitioned into two compact sets A and B. Then 
d(A,B) > and we set 6 = d{A, B)/2. Furthermore, because of uniform continuity of 
S* on 00, for all 5 > there exists e' > such that for all a £ Aoo, 9n £ -^^/(a) implies 
On+i = SiOn) £ Bs{a). 

Then if n > and 6n £ we have 

On+i £ BsiA) n B,{Aoo) = Be{A), 

since d{A,B) > 6. Thus On £ Bi.(A) for all n > and hence B = which concludes 
the proof. 

Corollary 1. If Aoo is finite, then Aoo = {^*} for some 6* £ Qq and the sequence 
iGn)eeN converges to 6*. 

Proof. Any connected finite set must be a singleton. 

Corollary 2. // the likelihood equations have only finitely many solutions that lie 
on the same contour of the likelihood function L, then the sequence {9n)neN converges 
to one solution 0* . 

Proof. This follows from Proposition ^ and Corollary 
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